Recursive integration of synergised graph representations of multi-omics data for cancer subtypes identification

Cancer subtypes identification is one of the critical steps toward advancing personalized anti-cancerous therapies. Accumulation of a massive amount of multi-platform omics data measured across the same set of samples provides an opportunity to look into this deadly disease from several views simultaneously. Few integrative clustering approaches are developed to capture shared information from all the views to identify cancer subtypes. However, they have certain limitations. The challenge here is identifying the most relevant feature space from each omic view and systematically integrating them. Both the steps should lead toward a global clustering solution with biological significance. In this respect, a novel multi-omics clustering algorithm named RISynG (Recursive Integration of Synergised Graph-representations) is presented in this study. RISynG represents each omic view as two representation matrices that are Gramian and Laplacian. A parameterised combination function is defined to obtain a synergy matrix from these representation matrices. Then a recursive multi-kernel approach is applied to integrate the most relevant, shared, and complementary information captured via the respective synergy matrices. At last, clustering is applied to the integrated subspace. RISynG is benchmarked on five multi-omics cancer datasets taken from The Cancer Genome Atlas. The experimental results demonstrate RISynG’s efficiency over the other approaches in this domain.

If x + iy and x − iy were two different eigenvectors of matrix G, then their inner product x 2 + y 2 would have been 0 because of the mutual orthogonality among the eigenvectors. That is not possible until x and y are 0, in which case, (3) and (4) would be indifferent. That is possible only if the initial assumption is contradicted and b is allowed to be 0 for all eigenvectors x. Hence, it is proved that all the eigenvalues of G are real.

Property 2 G is symmetric and positive semi-definite matrix.
Proof Pertaining to the fact that v i ∈ R d , the following should hold for some set of vectors x.
According to the elementary property of inner products, �x + y, x + y� = �x, x� + �x, y� + �y, x� + �y, y� . It implies that the sum of inner products in (5) can be taken forward as, Therefore, G is positive semi-definite matrix.
The previously described premise is often used in various methods of dimensionality reduction. Algorithms like Principal Component Analysis and its variants utilize kernel trick to map the observations into a higher dimension to make the data linearly separable. It is equivalent to projecting the mean-centered data onto a subspace on which its variance is maximum 29 . It is shown by Bernhard Scholkopf et al. 30 that algorithms like KPCA use a kernel function κ to essentially learn a mapping function φ for the input space R n into a high-dimensional Hilbert space F , which can be called as feature space. The process is demonstrated in (8) and (9). Therefore, for a data point v = (x 1 , . . . , x n ), x i ∈ R d , mapping into a feature space R n+k will be given by where, the value of p i depends upon the kernel that has been used for the mapping; however, kernels do not explicitly project the data into that high dimensional feature space; rather, it generates a Gramian matrix G of the mapped data in the aforementioned feature space F . Generated Gramian matrix enables the input data to be operated in that high-dimensional feature space 31 . If X = (x 1 . . . x n ), x i ∈ R d represent the input data. The corresponding Gramian matrix is given by Let G = U U T represent the eigen decomposition of G, where U is a matrix containing the eigenvectors of matrix G, arranged column-wise in descending order of their corresponding eigenvalues, which are present in the same fashion in the diagonal matrix as shown in (11) and (12).
Here, 1 ≥ · · · ≥ n ≥ 0 (see Property 3 of Gramian matrix), u T i u i = 1 for i ∈ {1, 2, . . . , n} and Gu i = i u i . Also note that in context of PCA Principal Components refers to the projection of the input data points onto the principal direction where the variance of the data is maximum. For PCA, the projection is given by y i = U T k x i for all i ∈ {1, 2, . . . , n} , where U k is a matrix of first k eigenvectors of G. However, in case of KPCA, the spectrum of G itself gives the projection of X 32 . Note that when φ(v) = v , Gramian matrix transforms into covariance matrix. Generalising both, if U k represent k principal axes, the algorithm finds a basis of an optimal low-dimensional subspace where the L 2 -norm of reconstruction error is minimum 33 . That is, for a test sample x In addition to dimensionality reduction, principal component analysis can also be used for k-clustering using a heuristic based k-means algorithm. This is done by performing k-means clustering in the projected space, as shown in heuristic k-means algorithm described in 34 . Graph Laplacian. Any set of observations appear to have an emergent behaviour to evince the properties of a graph when operated in a clustering pipeline. Therefore, given a set of data points X = (x 1 , x 2 , . . . , x n ) ∈ R d×n and a notion of similarity between any two points x i ,x j ∈ X , an undirected similarity graph S = (V , E) can be constructed out of them such that each vertex v i ∈ V represent a data point x i , and (v i , v j ) ∈ E represent the edge between vertices v i and v j . With each edge, there is an associated edge weight e ij that represent the similarity between the corresponding data points. Let the similarity matrix be W(i, j) = [e ij ] n×n . The degree d(v i ) associated with each node v i is given by The degrees of all the nodes/vertices can be wrapped in matrix form as shown in (15) These matrices act as a precursor for constructing a matrix of algebraic importance, called Laplacian matrix. The data can be composed as a discrete graph form by making graph Laplacian of its continuous representations like vector space or Riemannian manifolds. Laplacian matrix has many variants, so much so, that depending on the problem and available data, authors device their own version of graph Laplacian matrix 35 . The simplest graph (7) x T Gx = x T x ≥ 0. (8) φ : R n → F. (9) φ(v) = (x 1 , . . . , x n , p 1 , . . . , p k ) ∈ R n+k , (11) U =[u 1 , . . . , u n ], (12) � =diag( 1 , 2 , . . . , n ).  www.nature.com/scientificreports/ Laplacian, is given by ( D − W ). It is called unnormalise graph Laplacian matrix. However, in the proposed algorithm, the normalised graph Laplacian matrix has been used. That is, ) and I is the identity matrix of appropriate order. Considering the fact that similarity matrix is a Gramian matrix, it is apparent that Gramian and Laplacian are not much different. Laplacian can be characterised as the Gramian normalised over the degree matrix. The distinction between unnormalised and normalised graph Laplacian is better apparent in light of spectral clustering. Consider a strongly connected graph S = (V , E) . The purpose of clustering is to come up with the subsets of points according to their similarity, such that the similar points lie in the same subset. It is equivalent to finding the partitions of a graph such that the edge between different partitions has minimum weights. For two disjoint subsets A, B ⊂ V corresponding to two different partitions, the cut size is given by Let there be k clusters in the data. The aim of clustering is to find k such partitions A = (A 1 , A 2 , . . . , A k ) , such that the size of the cuts, as shown in (17), over all the partitions is minimum. That is where Ā i is the complement of A i . This is called the mincut problem. However, solving (18) alone does not achieve reliable clustering results. For example, for k = 2 , partitioning one vertex from the rest of the graph can also be a valid solution as per mincut. In clustering, each cluster needs to accommodate a reasonably large partition to be considered credible. Therefore, the objective function is redefined in following two ways where |A i | represent the number of vertices in partition A i and vol(A i ) = v j ∈A i d j .
However, solving these minimisation problems is NP hard. Laplacian matrix is an utility that can be used to approximate these minimisation problem. Consequently, unnormalised Laplacian serves in the approximation of the minimization of RatioCut, while normalised Laplacian serves in the approximation of the minimization of NCut. Therefore, the approximated objective function using normalised Laplacian is given by (21).
The above expression is minimum when U k ∈ R n×k is a matrix containing eigenvectors corresponding to k smallest non-zero eigenvalues of matrix L . This matrix is used to embed the data into a k dimensional euclidean space spanned by the vectors in matrix U, in which grouping of the data points is arguably easy even with simpler techniques like k-means. The described practice is known as Laplacian embedding. The embedded data is then subjected to k-means clustering algorithm for cluster discovery, as shown in Normalised Spectral Clustering presented in Ref. 36 .For a strongly connected graph with single component, the eigenvector corresponding to the trivial solution (i.e. = 0 ) of the eigenvalue problem of matrix L is a column vector of n ones. Therefore, L 1 n = 0 where 1 n = (1, . . . , 1) T . If the graph happens to have more than one components, then the multiplicity k of eigenvalue 0 if equal to the number of connected components in the graph. Nonetheless, with respect to clustering, the eigenvector(s) corresponding to eigenvalue 0 should be omitted while performing Laplacian embedding. It can be done by introducing a minor change in the matrix.
If the eigenpairs of L are given by then, the eigenpairs of (22) are given by Hence, the new eigenvalue problem becomes where, 0 = 1 < 2 · · · ≤ n ≤ 2 and f 1 = 1 n .

RISynG algorithm.
For grouping the cancer patients into clusters, each omic view is represented as a graph using two representation matrices, that is the Gramian matrix and the Laplacian matrix. Each of the representation matrices attributes the similarity network of the samples with a notion of similarity between the samples. Consider a view X m = (x 1 , x 2 , . . . , x n ) , x i ∈ R d m corresponding to mth omic-source. If ρ(x i , x j ) denotes the distance between x i and x j ∈ X m , then the similarity w(x i , x j ) between them is given by: where σ is a free parameter adjusted as per the intrinsic properties of the data when subjected to clustering model. For the cancer data used in this study, the σ is given by σ = max( for all x i , x j ∈ X m . It has been assumed in the proposed method that multi-views may constitute different cluster manifolds when learnt on a particular similarity measure. Therefore, predicted clusters would be apparent, and in strong concordance with the clinical clusters if pairwise sample similarity is computed in data-dependent multi-kernel approach. It was found that in some views correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters, while some of them showed proclivity towards Euclidean distance, and the rest seemed to accommodate parts of both. All things considered, two different graph representation matrices have been formulated, Gramian matrix and Laplacian matrix, both with different measures of similarity. Let for X m , the correlation distance (24) L1 n = L 1 n + 2 n (1 n 1 T n )1 n = 1 1 n + 21 n = ( 1 + 2)1 n . www.nature.com/scientificreports/ between x i and x j be given by ϕ m (x i , x j ) and the squared Euclidean distance be given by ε m (x i , x j ) . If φ m and ε m denotes the maximum pairwise correlation distance and squared Euclidean distance respectively, then Gramian matrix G m and similarity matrix W m are given by The matrix articulated in (28) is a crucial precursor for the construction of Laplacian matrix. Laplacian matrix is constructed by normalising W m by the degree matrix D m of its associated graph as in Eqs. (15) and (16). Hence, required representation matrices for each view X m , m ∈ {1, 2, . . . , M} are given by (27) and (29).
So obtained laplacian matrix is then modified as described in Eq. (22) It is apparent from the discussion presented under the heading Gramian Matrix and Kernel Trick and Graph laplacian that the matrix U k from Gramian matrix has the same role as that from Laplacian matrix. Therefore, for combining the information encoded in these matrices, a parameterised combination function (·, ·) can be used, hence obtaining a synergy matrix of representation matrices. If G m is the Gramian matrix and L m is the Laplacian matrix of omic-view X m , then the synergy matrix is given by: Consequently, the corresponding objective functions, (13) and (21) also combines to optimise over U k ∈ R n×k .
Some of the relevant properties of synergy matrix H m are:

Property 1 H m is symmetric and positive semi-definite matrix.
Proof H m can be called a positive semi-definite matrix if and only if v T H m v ≥ 0 for all v ∈ R n . Also, from the properties of the Graph Laplacian and the Gramian, it is evident that both L and G satisfies this condition. Therefore, In addition to that, since H m is a summation of symmetric matrices, it is also symmetric. Hence, it is proved that H m is a symmetric and positive semi-definite matrix.
Given Property 1, rest of the properties are its direct consequence. Recursive multi-kernel integration. After generating synergy matrices for all the views of the dataset, the next step is to integrate the information obtained from each of them. However, before moving to the integration step, the proposed approach needs these matrices to be arranged based on their relative relevance for cluster discovery. It is apparent that the better views would encode the cluster structure better. As a consequence of that, they would depict better cluster validity indices as well. Therefore Next, a method for combination has been proposed which distills the cluster information from each of the synergy matrix one by one, in an iterative fashion. While doing that, it subtly takes care of enriching the information coming from the relevant matrices. The way that the synergy matrices has been made, it is apparent that it is their basis of the eigenspace that brings out the latent cluster structure in the corresponding view. Therefore, the proposed method uses a recursive function to exploit this fact for integration as well as enrichment of the relevant views of the dataset. The recursive formula can be written as: www.nature.com/scientificreports/ Here k η is called accretive matrix of η th recursive step. Non-cumulative operator ⊗ signifies the integration operation. That is, for A ∈ R n×n and U ∈ R n×k , where A has its k smallest eigenvectors in V ∈ R n×k , and U is a basis matrix, the expression A ⊗ U evaluates to an accretive matrix A ′ ∈ R n×n with k smallest eigenvectors given by V + U . Other eigenvectors of A are irrelevant for this discussion. Let the basis of eigenspace of A ′ be known as accretive basis and associated subspace as accretive subspace. Also, let the accretive basis corresponding to k smallest eigenvectors of k η be given by b η . In extension to that, for enriching relatively relevant views, the proposed method uses an orthogonalisingnormalising function N(·, ·) . To ensure the accumulation of only the essential cluster information, the proposed approach acquires the basis of that projection of synergy matrix eigenspace which is orthogonal to the accretive subspace at that recursive step. The idea is similar to eigenspace updation for integrative clustering as performed in Ref. 18 . This function does not normalise the synergy matrix per se, rather, it normalises the basis of the described projection subspace. The computation starts by instantiating k 1 = 1 H so that b η becomes 1 U k . Therefore, at ( η + 1)th recursive step ( η ∈ {0, 1, . . . , M} ), one should have accretive matrix k η and eigenspace basis (η+1) U k of synergy matrix (η+1) H . Subsequently, processing within orthogonalising-normalising function N(k η , (η+1) U k ) renders the final basis matrix in four steps: First, computing the basis P of the projection subspace, which is given by: Second, computing the residual component of the synergy matrix eigenspace Q which is given by subtracting the above-mentioned projected component from (η+1) U k as: In the third step, Q is subjected to Gram-Schmidt orthogonalisation to yield the final basis R . This basis cannot be integrated with the eigenspace of accretive matrix, therefore it needs to be normalised on the basis of its relevance. So, the fourth step of normalization is performed as: Here the notation [·] denotes that the subsequent operations are done in element-wise fashion. The resultant V matrix is called as orthogonalised-normalised basis matrix. After the end of the process, the final accretive matrix k M is obtained whose first k eigenvectors in the matrix b M ∈ R n×k holds the cluster structure. Hence, performing k-means on the rows of the matrix b M returns the cluster labels for each sample. The proposed algorithm is described in Algorithm 1.
(34) k η+1 := k η ⊗ N(k η , (η+1) U), where k 1 = 1 H and η = 1, . . . , M. . Let the number of iterations (regulated through parameter β ) to learn the synergy matrix's best composition in steps 12 to 16 be t β . However, it has been found that for the datasets used in this study, the value of t β = 10 suffices. Iterating β from 0 to 1 with an increment of 0.1 with each iteration can produce an optimal combination ratio for the representation matrices. However, here, the increment step has been referred to as α for consistency. Assuming t max be the highest iteration by the k-means clustering algorithm the complexity of the aforesaid steps becomes O(t β n 3 + t β t max nk 2 + t β n) . Where t β n 3 comes from the complexity of eigenvalue decomposition of synergy matrix, t β t max nk 2 is for the step where k-means clustering is performed, and t β n is for the f-measure calculation. Therefore, the complexity of steps formulated from 12 to 16 turns out to be bounded by O(t β n 3 ) . Steps 17 to 19 are doing the same processing as previously, just at the optimal value of β . Hence, they are also bounded by O(t β n 3 ) . Summing up all the steps from 9 to 20 for M views, the complexity of O(Mn 2 + Mn 3 + Mt β n 3 ) reduces to O(Mt β n 3 ) . Sorting can be done at O(MlogM). After that, an accretive basis is constructed as defined in the function INTEGRATE(b, η ).
Step 5 consists of the construction of P , Q and orthogonalized-normalized matrix V. In this step, two matrix multiplication operations are bounded under the complexity of O(n 2 k) . Gram-Schmidt orthogonalization and normalization step combined has a complexity of O(n 2 ) . Therefore, step 5 has a complexity of O(n 2 k) .
Step 6 is matrix addition with complexity O(nk), but step 5 seem to dominate over that. In addition to that, since the function runs (M − 1) times, the complexity from steps 21 to 23 becomes O(MlogM + Mn 2 k) = O(Mn 2 k) . After the construction of the accretive basis, k-means is performed, which, as explained previously, has time complexity O(t max nk 2 ) . Considering everything, the overall complexity of RISynG comes out to be O(Mt β n 3 + Mn 2 k + t max nk 2 ) = O(Mt β n 3 ).
Significance of proposed algorithm. There are some aspects of the proposed algorithm that enhance its performance and make it unique from the other algorithms designed to identify cancer subtypes. Although each omic-view in the cancer dataset has its distinct cluster structure, the knowledge of cancer biology suggests that no omics-source to which each view belongs can dictate the final cancer subtype alone. Instead, all the omics sources collectively manifest the cancer subtype in a sample. Therefore, multi-view integration is critical to a sensible and clinically relevant clustering. The proposed approach can be broken down into three operative steps: (1) construction of representation matrices for each view, (2) construction of synergy matrix for each view, and www.nature.com/scientificreports/ (3) construction of accretive basis through recursive multi-kernel integration of synergy matrices. These steps make the proposed algorithm more effective in the following manner: 1. Construction of representation matrices To group the cancer patients into clusters, each omic-view first has to be represented as similarity graphs. These similarity graphs can be interpreted through various representation matrices like the Gramian, Laplacian, and Adjacency. Each representation matrix attributes the samples' similarity network with a notion of similarity between the samples. The proposed method assumes that multiple information sources may constitute different cluster manifolds when learned on a particular similarity measure. Therefore, predicted clusters would be apparent and in strong concordance with the clinical clusters if pairwise sample similarity is computed in a data-dependent multi-kernel approach 37 . In some views, Correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters. Whereas some of them showed proclivity towards Euclidean distance, the rest seemed to accommodate both. All things considered, two different graph representation matrices have been formulated, the Gramian matrix and Laplacian matrix, both with different measures of similarity. 2. Construction of synergy matrices Representation matrices so constructed have two noteworthy aspects: (1) G m represents a similarity graph formed using correlation-based distance. In the correlation-based distance, two objects are considered similar if the trends among their elements are highly correlated. That means the correlation distance between two perfectly correlated samples will be 0, even though they are far apart in the euclidean space of their dimension. It is instinctive to assume the omics data to behave like that.
(2) Laplacian, on the other hand, preserves the intrinsic manifold structure in the data casted on a low embedding space.
To integrate these representation matrices, a combination function has been devised that takes a convex combination of both the matrices. This method of combining matrices rectifies any bias created by the dissimilarity in distance measurement used while constructing the similarity graphs. The combination function defined in (31) utilises the parameter β ∈ [0, 1] to capture graphs constituted by the Gramian and Laplacian. Parameter β can only take a positive value, making the combination a convex combination of representation matrices. This parameter's optimal value is learnt by iterating it from 0 to 1 at some incremental step size α ∈ (0, 1) . The datasets used in this study tend to pick up the optimal value of β at a step size of α = 0.1 . It is crucial to choose the incremental step size wisely as the number of iterations t β is directly proportional to the algorithm's time complexity. Because the synergy matrix will ultimately affect the cluster assignment, the best way to evaluate the appropriate value of β is to perform a provisional cluster validity test on the synergy matrix constructed with that β using a cluster validity index like silhouette index. Algorithm-1, steps 15 to 19 formulate the described provisional cluster validity test using silhouette as a criterion. 3. Construction of accretive basis After the similarity between the cancer patients is captured in a refined form with the help of synergy matrices, the next step is to integrate them. Property 1 of the synergy matrix proves that H m is a positive semi-definite matrix. That makes the integration of synergy matrices a multi-kernel integration. The proposed algorithm does that by recursive multi-kernel integration by iteratively integrating each of the synergy matrices' relevant subspace. Here, relevant subspace refers to that subspace of the matrix that purely encodes the cluster information, which in the case of synergy matrix is its eigenspace corresponding to k eigenvalues. Finally, an accretive basis matrix is generated. This accretive matrix is required to have more cluster information coming from relevant views. Therefore, the orthogonalizing-normalizing function is made such that the accretive basis at each recursive step gets less influenced by the irrelevant matrix.

Description of datasets
For analysing the efficiency of the proposed algorithm for identifying cancer subtypes, it is applied to five cancer datasets taken from TCGA (https:// cance rgeno me. nih. gov/). The datasets used are Cervical cancer (CESC), Breast cancer (BRCA), Ovarian cancer (OV), Lower-grade glioma (LGG), and Stomach cancer (STAD). Different studies have identified 4 clinically important subtypes for BRCA 9 and STAD 38 , 3 for CESC 39 and LGG 40 and 2 for OV 41 . The cancer genome is neither simple nor independent but is complicated and dysregulated by multiple levels in the biological system through genomic, epigenomic, transcriptomic, proteomic levels 42 . miRNA, as one of the important regulators of gene expression, can be integrated with gene expression to identify the selective inhibition of translation or selective degradation [43][44][45] . Furthermore, in terms of epigenetic regulation, histone modification or DNA methylation can serve to regulate gene expression in cancer 46,47 . Also, protein expression data can be utilized for the diagnostic prognosis of cancer patients 48 . Therefore, four omic views, namely, gene expression (mRNA), microRNA expression (miRNA), DNA methylation (metDNA), and reverse-phase protein assays (RPPA), are utilized for CESC, BRCA, and LGG datasets. For STAD and OV datasets, mRNA and miRNA expression are only considered because metDNA and RPPA information are not available for most samples. To avoid involving features with too many missing values, more than 5% of missing values in all of the omic views are removed, and the rest of the missing values are replaced with 0. Sequence-based expression data are log-transformed to make the data more or less normally distributed 49 . Therefore the 0 entries from miRNA and mRNA expression data are replaced with 1 and then log-transformed with base 10. For metDNA datasets, beta values are considered. At last, variance filtering is applied to mRNA and metDNA omic views for all cancer datasets, and 2000 most variable genes and CpG locations were only considered.

Experimental results and discussion
The performance of the proposed approach is compared with eleven other algorithms available for cancer subtype identification. Both two-stage clustering approaches and integrative clustering approaches are considered for method comparison. The methods used for comparison are Similarity Network Fusion (SNF) 13  Performance analysis on multi-omics cancer datasets. The proposed approach and the abovedescribed methods are applied to five cancer datasets, namely CESC, BRCA, OV, LGG, and STAD, taken from TCGA. The sample clusters identified by these methods are evaluated based on several internal and external cluster evaluation indices. The cancer subtypes identified by these methods are also evaluated for their biological relevance. Next, the detailed comparative analysis of the proposed algorithm is discussed.
Cluster evaluation. The clusters (cancer subtypes) generated by all the methods are evaluated based on several internal and external cluster evaluation indices. These indices help get the idea of how well a method can group the samples into homogeneous clusters. Samples belonging to the same cluster should have higher similarity representing a cancer subtype, whereas samples belonging to different clusters should be highly dissimilar. How well an algorithm can capture the natural grouping present in the data can be quantified with internal validity indices. Following four internal evaluation indices are calculated in this study. Table 3, presents the internal evaluation indices for every method.
1. Silhouette Index: It measures the consistency present in the clusters. The value lies in the range [−1, 1] . A value nearer to + 1 indicates a higher distance between the clusters, a value of 0 indicates that the sample is very close boundary between two neighboring clusters, and a negative value indicates misclassification 57 .  CESC  124  2000  311  2000  219  3   BRCA  398  2000  278  2000  212  4   OV  474  2000  591  --2   LGG  267  2000  333  2000  209  3   STAD  223  2000 Here, a(i) = average dissimilarity of i th object to all other objects in the same cluster and b(i) = average dissimilarity of i th object with all objects in the closest cluster. 2. Dunn Index: A higher value represents better clustering solution 58 . It is defined as: Here, δ(C i , C j ) = distance between cluster C i and C j and �(C k ) = intra-cluster distance within cluster C k . 3. Davies-Bouldin Index: It is defined as the ratio of within cluster dispersion to between cluster dispersion 59 .
A lower value indicates better clustering.
Here, D i = max j� =i R i,j and R i,j = S i +S j M ij . M i,j is the separation between the ith and the jth cluster. S i and S j are the within cluster scatter for cluster i and j and C is the number of clusters. 4. Xie-Beni Index: The index for crisp clustering is estimated as: Here, 1 N WGSS represents the averaged-squared distance of all the points with respect to the barycenter of the cluster they belong to, and δ a measure of the between-cluster distance 60 . The class distribution of the cancer datasets used in this study is presented in Table 2. Except for the CESC dataset, all the other cancers have an imbalanced class. When clustering is applied to these datasets, there are chances that most of the samples get clustered into one group leading to good values for internal indices. Still, in reality, the clustering is not efficient. If the ground truth is available, the partitions created in such imbalanced data can be efficiently evaluated with external evaluation indices. In this study, five external evaluation indices are calculated to compare the clustering efficiency of the different algorithms. Considering a set of n objects X = {X 1 , X 2 , . . . , X n } , suppose C = {C 1 , C 2 , . . . , C R } represents a partition of X obtained by a clustering algorithm and K = {K 1 , K 2 , . . . , K C } represents the ground truth or the class information. A contingency table is created to look for the overlap between the clustering result and the ground truth, where n ij = |C i ∩ K j | is the common elements in cluster C i and class K j . n i is the number of elements in C i and n j is the number of elements in K j . The external indices are defined as: 1. F-measure (FM): The idea of precision and recall from information retrieval is merged to obtain FM. It disregards the unmatched portions of the clusters. It can attain values ranging between 0 and 1. A value nearer to 1 represents better clustering 61 . 3. Normalized Mutual Information (NMI): The inter-dependencies between cluster number and cluster quality can be quantified by NMI. It is estimated as: Here, I is mutual information and H is entropy. The value ranges from 0 to 1, value nearer to 1 means better clustering 63 . 4. Jaccard Index: It is used to measure the similarity between two sets, that are clustering solution, and the class information. It is defined as: Higher the value of this index better in the clustering. 5. Purity: For estimating Purity, the clusters are first allocated to that class which is present most frequently in the cluster. Later, the accuracy of this cluster-class allocation is obtained by dividing the number of correctly assigned objects to total number of objects 63 . The equation for calculating Purity is: Purity ranges from 0 to 1, a value closer to 1, better is the clustering. Based on these five external evaluation indices, it is observed that the proposed algorithm outperforms in CESC, BRCA, LGG, and STAD datasets. OV cancer is the only case where the proposed approach cannot work that well. Suppose all the datasets are considered together to rank the clustering efficiency of all the algorithms under study, considering all the external indices. In that case, the proposed method stands first by attaining a maximum value for 20 times out of 25. The execution times reported in Table 3 show that RISynG is faster than other algorithms.
Importance of multi-omics data integration. The proposed algorithm RISynG iteratively integrates the relevant subspace of each of the synergy matrices. The relevant subspace corresponds to the k largest eigenvectors of the synergy matrices that hold the cluster structure. To exhibit the significance of this iterative integration and the effectiveness of RISynG, it is compared with Spectral clustering performed on individual omics datasets. The results presented in Table 4 show that the proposed algorithm outperforms the individual omic-views in CESC, BRCA, LGG, and STAD datasets for all the external clusters validity indices. In the OV dataset, RISynG outperforms for F-measure, Jaccard, and Purity. However, the miRNA view performs better for ARI and NMI indices. The performance of RISynG is significantly higher than the best individual view in the case of CESC, BRCA, and LGG datasets, irrespective of any indices.
To express the cluster holding capacity of the integrated subspace obtained by the proposed approach, scatter plots for the best k dimensions are plotted. The colours in the plots indicate the ground truth (cancer subtypes). Comparative plots are also presented in Figs. 3, 4, 5, 6, and 7 to show that the integrated subspace obtained by RISynG are more informative than other subspace-based integrative-clustering approaches (SNF, SURE, CoALa, iCluster, WMLRR, and MiMIC), for most of the datasets. Comparison with the best individual omicview (CESC: mRNA, BRCA: metDNA, OV: miRNA, LGG: metDNA, and STAD:miRNA) is also presented to establish the significance of multi-omics data integration performed by the proposed approach. Considering the proposed approach, the scatter plots show that the clusters are well separated in the case of CESC (Fig. 3) and LGG (Fig. 6) datasets. There is a slight overlap between the two groups in BRCA (Fig. 4), but it is better than the other methods. Whereas, for OV (Fig. 5) and STAD (Fig. 7) datasets, the overlap between subtypes is observed in the subspace obtained by all the methods.
Biological analysis. Once the cancer subtypes are obtained, the patient clusters' molecular characteristic feature is also evaluated to establish their biological relevance. To understand the varying expression of different biomarkers in different subtypes, differential expression analysis (DEA) of miRNAs and mRNAs is performed between the correctly identified groups of patients. A comparative analysis is performed between the true posi-  Table 3. Comparative cluster analysis of proposed and existing approaches. The bold values indicate the best score as reported in the text. Biological enrichment analyses. The first analysis is Pathway enrichment analysis (PEA). It explores the mechanistic insight into the set of differentially expressed biomarkers. It helps identify those biological pathways enriched in a set of biomarkers more than expected by chance. The second one is Biological process enrichment analysis (BPEA). It helps characterize the relationship between genes or miRNAs by specifically annotating them to associated biological processes. It helps identify the over-represented biological processes in our list, which    www.nature.com/scientificreports/ can help evaluate the biological significance of the obtained cancer subtypes. Furthermore, the third one is Disease ontology enrichment analysis (DOEA). Disease Ontology (DO) helps map the relevance of cancer subtypes identified from high-throughput data to clinical relevance. In this study, the R package, clusterProfiler 65 and DIANA Tools mirPath v.3 66 are used for performing PEA and BPEA for genes and miRNAs, respectively, and R package DOSE 67 is used to perform DOEA for the genes. The top 100 differentially expressed biomarkers are passed to these tools. In some cases, if the number of differentially expressed biomarkers is less than 100, then all of them are used. KEGG database is selected for PEA 68 . All the pathway terms associated with the set of  www.nature.com/scientificreports/ biomarkers having false discovery rate adjusted p-value < 0.05 (significant pathway terms) are only considered. Suppose any differentially expressed biomarker sets are not associated with significant KEGG pathway terms. In that case, that set is said to be not biologically relevant with respect to KEGG pathway terms. Similarly, all the biological process (BP) terms associated with the set of biomarkers having a false discovery rate adjusted p-value < 0.05 (significant pathway terms) are only considered. If any of the differentially expressed biomarker sets are not associated with significant BP terms, that set is said to be not biologically relevant with respect to BP terms. In DOEA, semantic similarities between DO terms and genes are calculated that help explore the similarities of diseases and gene functions from a disease perspective. The output of DOES has associated disease terms. A gene set is said to be enriched with DO terms if the terms obtained by its DOEA have a false discovery rate corrected p-value < 0.05. For the quantification of KPEA, BPEA, and DOEA, respective enrichment scores 69 , and annotation ratios 69 are calculated. The higher the value of these scores better is the enrichment; hence, the more biologically significant the differentially expressed biomarkers are, the better the cancer sub-typing. Following are the equations for these scores: Here, T denotes the number of significant pathway/BP/terms associated with a set of differentially expressed genes or miRNAs between two cancer subtypes identified by any clustering approaches. G denotes the total number of genes given to clusterProfiler for the enrichment analysis, and g denotes the gene count associated with a pathway/BP/DO term. Comparative analysis of the cancer subtypes obtained by the proposed approach and other existing algorithms are performed and the associated quantitative indices are reported in Tables 5, 6, 7, 8, and 9. Some of the differentially expressed miRNAs or mRNAs have no associated significant terms; therefore, there is no scope for calculating the quantitative indices. Also, in some cases, there are no differentially expressed biomarkers. All these cases are represented by * in the tables.
To compare the effectiveness of the proposed approach with the other algorithms in this study, the overall performance of all the methods is also evaluated. When all the five cancer datasets are considered together, the proposed approach outperforms concerning both cluster evaluation indices and biological enrichment analysis, as shown in Fig. 8. The analysis is performed by considering the success frequency (number of times a method scored the highest value for respective indices when all the cases in all the cancer types are considered). The success frequency shows that the proposed approach outperforms when cluster validity indices are considered by scoring maximum values for 21 times, followed by SNF.CC (7), SNF (6), CNMF (5), CC (2), COCA (2), and www.nature.com/scientificreports/ WMLRR (1). Similarly, suppose the methods are ranked considering the success frequency for quantitative indices calculated for biological enrichment analysis. In that case, the proposed approach will again stand first by scoring the maximum value 67 times, followed by SNF (21), SNF.CC (20), CC (12), CoALa (10), CNMF (9), MiMIC (7), SURE (5), WMLFF (5), COCA (4), and iCluster (1). If the cluster validity indices are looked upon individually, the proposed approach also outperforms with respect to F-measure, ARI, NMI, Jaccard index, and Purity. Considering the indices for biological enrichment individually, the proposed algorithm again outperforms with respect to all the indices except for AR for BPES for mRNA enrichment, where it stands second.
Overlap analysis. The hundred most differentially expressed genes between all the subtypes-pairs in cervical cancer that RISynG and the other methods identified are explored further for experimental support. The genes are analyzed based on the degree of overlap with known cervical cancer genes that are experimentally validated.  Table 10. In total, 30 genes out of 222 identified from the proposed approach overlap with cervical cancer-related genes. This is the maximum overlap when compared with the other methods. Fisher's exact test is used here to find the statistical significance of the contingency table created from the overlap analysis in Table 10 for different algorithms. At 95% confidence, it is observed that only the genes identified by the proposed approach have significant overlap with experimentally validated genes curated from literature with a p-value of 0.026. Therefore, it indicates that the proposed approach has the potential to identify clinically important subtypes of cancer that have a characteristic molecular signature.     Table 9. Comparative biological analysis of STAD dataset. The bold values indicate the best score as reported in the text.

Conclusion
The present study describes a method named RISynG that efficiently identifies cancer subtypes. Cancer subtypes identification can facilitate cancer diagnosis and therapy. It is one of the vital components of the precision medicine framework. The main contributions of this study are: (1) Development of an integrative clustering method for multi-view omics data. (2) Demonstration of the effectiveness of the proposed method over other methods.
(3) Establishing biological relevance for the obtained results.